首页> 外文OA文献 >Perbandingan Feature Kata dan Frasa dalam Kinerja Clustering Dokumen Teks Berbahasa Indonesia

【2h】

Perbandingan Feature Kata dan Frasa dalam Kinerja Clustering Dokumen Teks Berbahasa Indonesia

机译：印尼文本文档聚类性能中特征词和短语的比较

页面导航

摘要
著录项
相似文献
相关主题

摘要

Text document clustering has been intensively studied because of its important role in text-mining andinformation retrieval. High dimensionality problem caused by high number of words is always happened inword-based clustering technique using vector space model. Although extracting words in the preprocessingphase is simple, the collection itself is not only can be viewed as a set of words but also a set of partly more thanone word phrase. Separating a phrase into its parts can eliminate the actual meaning of phrase. Therefore inorder to maintain the context of words a phrase must be maintain as a phrase. It is assumed that by addingphrases to words as features in clustering will improve the performance. This paper will study the comparison ofword-base and phrase-based clustering. Three clustering models was chosen i.e. hierachical, partional andhybrid model. Four similarity technique i.e. GroupAverage, CompleteLink, SingleLink, and ClusterCenter wastried for hierarchical, K-Means and Bisecting K-Mean for partitonal and buckshot for hybrid. Documentcollections from 200-800 news text that has been categorized manually was used to test these algorithms byusing F-measure as criteria of clustering performance. This value was derived from Recall and Precision andcan be used to measure the performance of the algorithms to correctly classify the collections. Results show thatby adding phrases or simply word pair, although it\u27s still not statistically significant, it slightly improves theperformance of clustering.

机译：由于文本文档聚类在文本挖掘和信息检索中的重要作用，因此对其进行了深入的研究。使用向量空间模型的基于词的聚类技术经常发生由大量词引起的高维问题。尽管在预处理阶段提取单词很简单，但是集合本身不仅可以看作是一组单词，而且可以看作是一部分以上的多个单词短语。将短语分为几个部分可以消除短语的实际含义。因此，为了维持单词的上下文，必须将短语保留为短语。假设通过在词中添加短语作为聚类中的特征将改善性能。本文将研究基于词的聚类和基于短语的聚类的比较。选择了三个聚类模型，即层次模型，局部模型和混合模型。尝试了四种相似性技术，即GroupAverage，CompleteLink，SingleLink和ClusterCenter用于分层，K-Means和二等分K-Mean用于部分式和buckshot用于混合。通过将F-measure用作聚类性能的标准，使用了手动分类的200-800个新闻文本的文档集来测试这些算法。该值来自Recall和Precision，可用于测量算法的性能以正确分类集合。结果表明，通过添加短语或简单地使用单词对，尽管它在统计上仍不显着，但可以稍微提高聚类的性能。

著录项

作者
Hamzah, Amir; Susanto, Adhi; Soesianto, F; Istyanto, Jazi Eko;
展开▼
作者单位

展开▼
年度 2007
总页数
原文格式 PDF
正文语种 ID
中图分类

相似文献

外文文献
中文文献
专利

1. Ekstraksi Kata Dasar Secara Berjenjang (Incremental Stemming) Berbasis Aturan Morfologi untuk Teks Berbahasa Indonesia [J] . Wahyu Hidayat Jurnal Infotel . 2017,第2期

机译：基于印度尼西亚语素的形态规则提取水平（增量词干）基本单词
2. Perbandingan Kinerja Tool Data Mining Weka dan Rapidminer Dalam Algoritma Klasifikasi [J] . Mochammad Faid, Moh Jasri, Titasari Rahmawati Teknika . 2019,第1期

机译：分类算法中Weka和Rapidminer数据挖掘工具的性能比较
3. Peningkatan Kinerja Pencarian Dokumen Tugas Akhir Menggunakan Porter Stemmer Bahasa Indonesia dan Fungsi Peringkat Okapi BM25 [J] . Monica Widiasri, Ellysa Tjandra, Lisa Maria Chandra Teknika . 2017,第1期

机译：使用印尼搬运工和评级函数Okapi BM25改善了最终项目文档搜索的性能
4. Indonesian Gender Equality Survey Analysis Using Feature Selection Based Clustering [C] . Takako Hashimoto, Kilho Shin, David Lawrence Shepard, International Conference on Awareness Science and Technology . 2020

机译：印度尼西亚性别平等调查分析使用基于特征选择的聚类
5. Literatur Review Terhadap Metode, Aplikasi dan Dataset Peringkasan Dokumen Teks Otomatis untuk Teks Berbahasa Indonesia [O] . Yuliska Yuliska, Khairul Umam Syaliman 2020

机译：文献综述方法，应用程序和数据集自动文本文档为印度尼西亚语文本

Perbandingan Feature Kata dan Frasa dalam Kinerja Clustering Dokumen Teks Berbahasa Indonesia

摘要

著录项

相似文献

相关主题

期刊订阅